common/parser: add proper reasoning tag prefill reading#20424
common/parser: add proper reasoning tag prefill reading#20424pwilkin merged 14 commits intoggml-org:masterfrom
Conversation
|
Dumb question, why not find the start of the assistant message and prepend that? I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill. |
|
Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial. |
|
Qwen3.5 uses however, It probably doesn't matter for this model, but it is technically not adhering to the template. |
Maybe set |
Run the template once with |
|
That usually works, yeah 😀 I can try that and see what the results are (this is what |
|
Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1. |
3bfb08f to
4083259
Compare
|
@aldehir changed the prefill extraction behavior to the differential one you mentioned. |
common/chat.h
Outdated
| std::string grammar; | ||
| bool grammar_lazy = false; | ||
| bool thinking_forced_open = false; | ||
| std::string prefill; |
There was a problem hiding this comment.
Think we name this generation_prompt? It lines up with the add_generation_prompt flag.
| bool clear_reasoning_start = false; | ||
| if (inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE && | ||
| autoparser.reasoning.mode != reasoning_mode::NONE && | ||
| !autoparser.reasoning.end.empty()) { | ||
| const auto & r_start = autoparser.reasoning.start; | ||
| const auto & r_end = autoparser.reasoning.end; | ||
| auto r_end_t = trim_trailing_whitespace(r_end); | ||
| auto r_start_t = trim_trailing_whitespace(r_start); | ||
|
|
||
| if (!r_start_t.empty()) { | ||
| auto start_pos = prompt_to_search.rfind(r_start_t); | ||
| if (start_pos != std::string::npos) { | ||
| std::string from_start = prompt_to_search.substr(start_pos); | ||
| auto fs_trimmed = trim_trailing_whitespace(from_start); | ||
|
|
||
| if (string_ends_with(fs_trimmed, r_end_t)) { | ||
| data.prefill = r_start + r_end; | ||
| } else if (string_ends_with(fs_trimmed, r_start_t)) { | ||
| data.prefill = from_start; | ||
| } else { | ||
| clear_reasoning_start = true; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
So my understanding is: we have a generation prompt G, we can create a parser that accepts G[0:max(G.size(), G.index_of(reasoning_start))] + (reasoning_start + reasoning + reasoning end)? + .... Then we can do away with all the trim logic.
The benefit is that now the parser can properly parse assistant prefill from the user, since the parser starts from the beginning of the assistant message.
I see that Mistral's templates have no generation prompt, so G = "". But this is fine, because the model emits the [THINK] tag. So the above still works.
There was a problem hiding this comment.
This workaround is mostly for Apriel that has a delimited thinking format and inserts a header "Thinking chain starts here: " or something like that as the generation prompt which acts as a quasi-reasoning marker that we want to strip.
|
@aldehir okay, that rewrite ended up being a bit bigger than I expected... but it's exactly the algorithm you mentioned now. |
|
Oh jeez, well it's <100 net LOC. I'll give it a whirl. |
|
@aldehir happy to report I added another nice piece of code to make it work correctly with grammars / schemas :) |
a75c1f8 to
72b9fa3
Compare
|
With OpenCode and Unsloth's Qwen3.5 35B, I seem to be getting all think blocks in the response now with
This is with "show thinking" off. Wasn't happening yesterday, so I'm guessing it's related? |
|
@strawberrymelonpanda argh. Can you check with vanilla template? |
|
Willing to, is there a command I can use to bypass unsloth's or do I need a different model? |
|
@pwilkin looks like it's happening on Ubergarm's quant too.
|
|
@strawberrymelonpanda can you give the exact server command? I'm trying to repro on various Qwen3.5 4B quants but all seems correct so far. |
(Tested with both models) Start opencode in llama.cpp folder, @ commit c1b9116 (master). I ran the command
(I was seeing if it could, for local use, because I'd like this without a full unload - I think there's some differences?) For this command in specific, I first see thinking work twice: Then start failing: I can't promise it's related, but my OpenCode is set to manual update and hasn't changed, and after rolling back the The same commands on c125883 and continues on throughout the task without issue. |
…on) (#20777) * chat : fix out_of_range crash in throw path (#20424 regression) #20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
|
@pwilkin I'm using a deterministic seed, --seed 1, so with any luck maybe you can reproduce it. Not sure how much the RNG varies based on hardware though. |
|
Can you please try without the speculative decoding as well? (gtg to sleep now but will try to repro tomorrow) |
|
Same results without For that command, works twice, fails twice, etc. |
|
If you're not familiar with OpenCode, here's a opencode.json
You can remove this line, but as of recent commits it'll cause the large model to try to make the title instead, which can be a significant delay, so I keep it for a fast fail. |
|
@strawberrymelonpanda I test everything on OpenCode, which is why I'm surprised to see this arise. I'll try to repro on the smaller models first, but I have a 35B quant lying around somewhere if I can't. |
|
@strawberrymelonpanda got a repro, looking into it. |
|
Aaaaand it's gone... reproduced once, set up an MITM proxy and now it's gone :P |
* Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…regression) (ggml-org#20777) * chat : fix out_of_range crash in throw path (ggml-org#20424 regression) ggml-org#20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
|
@strawberrymelonpanda happy to report I found the cause :) Can you please check if #20825 resolves it? |










This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting
<think>in assistant prefill after a "no_think" appears in a user message).Therefore, the
FORCED_OPENandFORCED_CLOSEDformats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deletedDELIMITERalso since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the<think>' /` was added by the template or generated by the model.